Automatic Data Extraction from Data-Rich Web Pages
نویسندگان
چکیده
Extracting data from web pages using wrappers is a fundamental problem arising in a large variety of applications of vast practical interests. In this paper, we propose a novel technique to the problem of differentiating roles of data items from Web pages, which is one of the key problems in our automatic extraction approach. The problem is resolved at various levels: semantic blocks, sections and data items, and several approaches are proposed to effectively identify the mapping between data items having the same role. Intensive experiments on real web sites show that the proposed technique can effectively help extracting desired data with high accuracies in most of the cases.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملUnsupervised Structured Data Extraction from Template-generated Web Pages
This paper studies structured data extraction from template-generated Web pages. Such pages contain most of structured data on the Web. Extracted structured data can be later integrated and reused in very big range of applications, such as price comparison portals, business intelligence tools, various mashups and etc. It encourages industry and academics to seek automatic solutions. To tackle t...
متن کاملExperiences regarding Automatic Data Extraction from Web Pages
Existing methods of information extraction from HTML documents include manual approach, supervised learning and automatic techniques. The manual method has high precision and recall values but it is difficult to apply it for large number of pages. Supervised learning involves human interaction to create positive and negative samples. Automatic techniques benefit from less human effort but they ...
متن کاملA Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages
World Wide Web is a vast and rapidly growing source of information. Web Pages contain a combination of unique data and template material, which is present across multiple pages to achieve high productivity of publishing. The template detection becomes a more attractive technique in the web pages, since the unknown template degrade the performance of web applications due to the irrelevant terms ...
متن کاملTowards Automatic Structured Web Data Extraction System
Automatic extraction of structured data from web pages is one of the key challenges for the Web search engines to advance into the more expressive semantic level. Here we propose a novel data extraction method, called ClustVX. It exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and...
متن کامل